Skip to content

Conversation

chrisparrinello
Copy link
Contributor

Added es.search.coordinator.phases.query.duration.histogram APM metric to track the duration of the search query phase at the coordinator level..

to track the duration of the search query phase.
@chrisparrinello chrisparrinello requested a review from a team as a code owner October 6, 2025 20:47
@chrisparrinello chrisparrinello requested review from javanna and smalyshev and removed request for a team and javanna October 6, 2025 20:47
@elasticsearchmachine elasticsearchmachine added v9.3.0 needs:triage Requires assignment of a team area label labels Oct 6, 2025
@chrisparrinello chrisparrinello added >enhancement Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch :Search Foundations/Search Catch all for Search Foundations and removed needs:triage Requires assignment of a team area label labels Oct 6, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-foundations (Team:Search Foundations)

Copy link
Member

@javanna javanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments, thanks!


@Override
protected void doRun(Map<SearchShardIterator, Integer> shardIndexMap) {
phaseStartTimeNanos = System.nanoTime();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be potentially streamlined into the run method in the parent class. We may be able to even report the latency at the coordinator in a generic manner, with all the code in AbstractSearchAsyncAction?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My other PR attempted the "generic" path as well. There is some weirdness around the PIT creation and queries that caused a lot of issues in CI but I think I have an idea of where the issue was. I'll try and move this code into the AbstractSearchAsyncAction and that should cover at least DFS and query phases.

I'll have to check if that helps fetch other subsequent phases. The issue with them is that they don't subclass off of AbstractSearchAsyncAction but reference it via a passed in context so those phases might not hit run to set the start time of the phase. I'll have to do some debug tracing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean. I'd limit the change to tracking the two variations of query phase and perhaps open point in time. The AbstractSearchAsyncAction subclasses to be more concrete. I would not consider the other so called search phases that important to be honest. I care about can match, dfs, query, fetch. We can expand further later as needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, does "dfs" cover both the "DFS roundtrip" and the DFS specific query implementation (i.e. basically timing SearchDfsQueryThenFetchAction)? Or do we want "DFS roundtrip", "normal query", "DFS specific query" in addition to can match and fetch times?

request.getMaxConcurrentShardRequests(),
clusters
clusters,
coordinatorSearchPhaseAPMMetrics
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit surprised that we don't record latency for this. I don't want to confuse you , I don't mean reporting dfs phase latency at the coordinator. What I mean is that DFS query then fetch has an additional DFS roundtrip in the beginning, but after DFS it executes the query phase, yet the codepath is all in SearchDfsQueryThenFetchAction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah its a separate code path if you do a DFS query. It joins the "normal" code path when you get back to the Fetch and subsequent phases. In my original PR, it was reporting a "dfs" and a "dfs_query" phase duration. Not for this PR but for the future one where we record the DFS phase metric, do we want to record a separate DFS roundtrip metric? Also, do we want to differentiate the two code query code paths with two different metrics (DFS and non-DFS query phases) or record both paths with the same query phase metric?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good question, I'd keep the two variations of query phase separate.

Copy link
Contributor

@smalyshev smalyshev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some code structure comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search Foundations/Search Catch all for Search Foundations Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch v9.3.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants